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Abstract — The g-Gaussian distribution results from maximiz- 
ing certain generalizations of Shannon entropy under some 
constraints. The importance of g-Gaussian distributions stems 
from the fact that they exhibit power-law behavior, and also 
generalize Gaussian distributions. In this paper, we propose 
a Smoothed Functional (SF) scheme for gradient estimation 
using g-Gaussian distribution, and also propose an algorithm for 
optimization based on the above scheme. Convergence results 
of the algorithm are presented. Performance of the proposed 
algorithm is shown by simulation results on a queuing model. 

I. Introduction 

Stochastic optimization algorithms play an important role in 
optimization problems involving objective functions that can- 
not be computed analytically. These schemes are extensively 
used in discrete event systems, such as queuing systems, for 
obtaining optimal or near-optimal performance measures. 

Gradient descent algorithms are used for stochastic opti- 
mization by estimating the gradient of average cost in the long 
run. Methods for gradient estimation by random perturbation 
of parameters have been proposed in JT]. The Smoothed 
Functional scheme, described in Q, approximates the gradient 
of expected cost by its convolution with a multivariate normal 
distribution. Based on all the above schemes, two-timescale 
stochastic approximation algorithms have been presented in 
0, which simultaneously perform cost averaging and param- 
eter updation using different step-size schedules. 

In this paper, we propose a new SF method based on q- 
Gaussian distribution, which is a generalization of the normal 
distribution. We show that g-Gaussian satisfies all the condi- 
tions for smoothing kernels proposed by Rubinstein [4|. We 
illustrate a method for gradient estimation using g-Gaussian. 
We also present a two-timescale algorithm for stochastic 
optimization using g-Gaussian based SF, and show the con- 
vergence of the proposed algorithm. 

The rest of the paper is organized as follows. The framework 
for the optimization problem and some of the preliminaries 
are presented in Section [II] The gradient estimates using q- 
Gaussian SFs have been derived in Section |III] Section [TV] 
presents the proposed algorithm. Numerical experiments com- 
paring our algorithm with a previous algorithm is presented in 
SectionJV] An outline of convergence analysis of our algorithm 
is discussed in Section VI Finally, Section VII provides the 
concluding remarks. 



II. Background and Preliminaries 

A. q-Gaussian distribution 

Most of the distributions, like normal, uniform, exponential 
etc., can be obtained by maximizing Shannon entropy func- 
tional defined as H(p) = J p(x)lnp(x)dx, where p(.) is a pdf 

x 

defined on the sample space X. Other entropy functions have 
also been proposed as generalized information measures. One 
of the most popular among them is Tsallis entropy|5| defined 

as 

1 - J[p(x)] q dx 
H q (p) = ^— - , qeR. (1) 



Its consistency with the discrete case is shown in [6|, and 
the Shannon-Khinchin axioms for the discrete case in such a 
generalized setting is established in Q. This entropy function 
produces Shannon entropy as q — > 1. Corresponding to this 
generalized measure, g-expectation of a function /(.) can be 
defined as 

f f(x)\p(x)}«dx 
= R fipix^x ■ (2) 

E 

Maximizing Tsallis entropy under the following constraints: 



and 



(3) 



results in g-Gaussian distribution |8], which is of the following 
form: 
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where y + = max(y,0) is called Tsallis cut-off condition, and 
K q is the normalizing constant, which depends on the value 
of g. The function defined in Q is not integrable for q ^ 3, 
and hence, g-Gaussian is a probability density function only 
for g < 3. 

Multivariate form of the g-Gaussian distribution J9) is 
defined in the following way: 



q,N 



(1-g) 

(3-g)/? 2 



\x\ 



(5) 



where K Qi n is the normalizing constant. The multivariate 
normal distribution is a special case of |5]l as q —> 1. 



B. Problem Framework 

Let {Y n } ne m C M. d be a parameterized Markov process, 
depending on a tunable parameter 9 £ C, where C is a 
compact and convex subset of R N . Let Pg(x,dy) denote the 
transition kernel of {Y n } when the operative parameter is 
9 e C. Let h : R d ^ R + \J{0} be a Lipschitz continuous 
cost function associated with the process. 

Assumption I. The process {Y n } is ergodic for any given 9 
as the operative parameter, i.e., 

1 

2 h( - Y ^ ~> E -oHY)} as L > oo, 

m=0 

where vg is the stationary distribution of {Y n }. 

Our objective is to minimize the long-run average cost 



i r 

J{9) = lim - V h(Y m ) = / h(x)v e {da 

L— too L — ' I 

m=0 X, 



(6) 



by choosing an appropriate 9 g C. The existence of the above 
limit is given by Assumption [I] In addition, we assume that 
the average cost J{9) satisfies the following condition: 

Assumption II. J{9) is continuously differentiable with re- 
spect to any 9 € C. 

We also assume the existence of a stochastic Lyapunov 
function through the following assumption: 

Assumption III. Let {9(n)} be a sequence of random pa- 
rameters, obtained using an iterative scheme, controlling the 
process {Y n }, and J- n = a{9(m), Y m , m n), n denote 
the sequence of associated a-fields. 

There exists 6q > 0, JC C R d compact, and a continuous 
M. d -valued function V, with lim|| a .||_^ 00 V(x) — oo, such that 
under any non-anticipative {9(n)}, 

(i) sup n E[V(Y n ) 2 ] < oo and 

(ii) E[V{Y n+1 )\T n ] < V(Y n ) - eo. when Y n £ JC, n > 0. 

Assumption [II] is a technical requirement, whereas As- 
is used to show the stability of the scheme. 



Ill 



sumption 

As sumption HU\ will not be required, for instance, if the single- 



stage cost function h is bounded in addition. 

C. Smoothed Functionals 

Given any function / : C h-> R, its smoothed functional is 
defined as 



Sp[f(0)]= / Gp(r,)f(9-r,)d V = / G fj {9 - r,)f(r,)dr, 



(7) 

where Gp : R N i— > E is a kernel function. 

The idea behind using smoothed functionals is that if f(9) 
is not well-behaved, i.e., it has a fluctuating character, then 
Sp[f(8)] is better-behaved. This ensures that any optimization 
algorithm with objective function f(9) does not get stuck at 
any local minimum, but converges to the global minimum. The 



parameter (3 controls the degree of smoothness. Rubinstein H 
shows that the SF algorithm achieves these properties if the 
kernel function satisfies the following sufficient conditions: 

(PI) Gp{rj) = ^G(f), 



,(D „<2) 



where G(j ) = G x (f ) = G^, ^, . . . , ^). 
(P2) Gpirf) is piecewise differentiable in rj. 
(P3) Gp(rj) is a probability distribution function, 

i.e.,S p [f(9)}=E Gp{r , ) [f(e-r,)}. 
(P4) ]im^ G p (r]) = 5(ri), 

where S(rj) is the Dirac delta function. 
(PS) lim^ Sp[f(9)}=f(9). 

The normal distribution satisfies the above conditions, and 
has been used as a kernel by Katkovnik J2|. 

Based on a form of gradient estimator has been derived 
in 1 3 1 which is given by: 



M-l L-l 



V e [J(9)} 



]T Y, v(n)h(Y m 



(8) 



n— m— 



for large M, L and small (3. The process {Y m } is governed by 
parameter (9{n) + f3r/(n)), where 9{n) eCc is obtained 
through an iterative scheme. T)(n) is a iV-dimensional vector 
composed of i.i.d. Af(0, 1) -distributed random variables. 

III. (j-Gaussian for Smoothed Functionals 

Proposition 3.1. The q-Gaussian distribution satisfies the 
kernel properties (PI) - (P5) for all q < 3, q =/= 1. 

Proof: 

(PI) From (|5]), it is evident that G qt p(j)) = ~^G q ( ^ 



(P2) For 1< q < 3, (l - $=$p \\v\\ 2 ) > 0, for all r) G R N . 

1 

Hence, G q ^(q) 
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For q < 1, when ||?7|| 2 < ( - 3 , 1 ^ ? - ) ^ , we have 



(1-9) 

Nl 2 ) -0. 



(l-g) NtiI |2 

(3-g)/3 2 



So, (|9) holds. On the other hand, when ||?7|| 2 ^ (3 (1 q)f> 
we have ( I - ^^\\v\\ 



(1-9) ' 

^ 0, which implies 



G 9t p(rj) = and, V n G q ^(r]) = 0. 
Thus, Gq,«(ry) is differentiable for g > 1, and piecewise 
differentiable for q < 1, 
(P3) G q fi(r}) is a distribution for g < 3 and hence, 

WW] =E G , iiB(lj) [/(*-»/)]. 



(P4) As p -> 0, G 9ij9 (0) = 



oo. But, we have 
y G q ,p(r))dr] = 1 for g < 3. So, lim G q ,p(v) = <5 (77). 
(P5) It follows from dominated convergence theorem that 



limS^[/(0)]= / ljmG q ,p(r,)f(0-ri)dTi 



— OO 
OO 



5(r,)f(0-r ] )dr, = f(e). 



Our objective is to estimate VgJ(6) using the SF approach. 
The existence of VgJ(9) is due to Assumption [H] Now, 



V e J(#)= V^J(0) v<, 2) j(<?) 



7 (N) 



J(9) 



Let us define, f2„ 



\v\\ 2 <^ q f}iorq<l, 



and fl q = ~R N for 1 < q < 3. It is evident that Sl 9 is the 
support set for the g-Gaussian distribution with q-variance /3 2 . 
Define the SF for gradient of average cost as, 



D Q AJ(0)} = S q , [V^J(6) 



= I G q , (e-r))V n J(r,)dri 

R N 



It follows from integration by parts and the definition of fl q , 
D q , [J(e)} = Jv n G q A.ri)J(d-v)dv 



Substituting 77 = we have 



fjJ(9 + Pfj) 



^«/(6> + pfj) 



G q (r))dr] 



(10) 



We first state the following lemma which will be required 



to prove the result in Proposition 3.3 



Lemma 3.2. Let f : R N i— > K be a function defined over a 
standard q-Gaussian distributed random variable X £ M. , 

i.e.,{X) q = and (XX T ) q = Inxn, 



then, (f(X)) q 



A„ 



-E, 



G q (X) 



f(X) 



1 _ II V||2 

1 (3-g)II^H 



where A q = [(K q , N ) q 1 J RN [G q (x)] g dx], K q , N being the 
normalizing constant for N-variate q-Gaussian. 



Proof: From <j2j 

/ f(x)[G q (x)]"dx 

(f(X))q = r t n / mo , 

J [G g (a;)]9da; 

' f /(*) fl 
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Proposition 3.3. For a given q < 3, q ^ 1, as P — > 0, SF for 

the gradient converges to a scaled version of the gradient, 



i.e., 



DqAAQ)] 



2A 



-> as p 0. 



(3-9) 

Proof: For small /?, using Taylor series expansion, 

1 



J{6 + pfj) = J(6) + PtV e J{6) + ~0 z fV 2 e J{e)fj + o(f) 



By Lemma [3~2| 

D q Am\ = 



2A 9 
0(3 - g) 



fjJ{9 + Pfj) 



2A q 
0(3 - q) 



2A g 
(3-9) 
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V e J(9) as (3 -> 0. 



Thus. /), :. ./■:«: (i 

As a consequence of the Proposition |3.3| for large M and 
small (3, the form of gradient estimate suggested by (jT0j is 



V e [J(6)] 
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(ii) 



Using an approximation of |6]), for large i, we can write the 
above equation as 



V 9 [J(0)] 
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(12) 



where {F m } is governed by parameter (0(n) + Pfj(n)). 

However, since A g > 0, A q need not be explicitly deter- 
mined as estimating [A V# J (9)} instead of VgJ{9) does not 
affect the gradient descent approach. As a special case, for 
q = 1, we have A q = 1 from definition. Hence, we obtain the 
same form as in dSli. 



IV. Proposed Algorithms 

In this section, we propose a two-timescale algorithm cor- 
responding to the estimate obtained in ( |12) . 

The g-Gaussian distributed parameters (77) have been gen- 
erated in the algorithm using the method proposed in ifTUl . 
Let {a(n)}, {b(n)} be two step-size sequences satisfying 



Assumption IV. a(n) = o(b(n)), a(n) = b(n) = oo, 

00 00 
and a(n) 2 , b(n) 2 < 00. 

,i=0 ra=0 

For 9 = (#«,..., #W) T e r n , let r(0) = (r(0«), . . . , 

T(8^ N ^)) T represent the projection of 9 onto the set C. 
{Z^(n),i — l,...,JV} ne N are quantities used to estimate 
[A q Vg J (9)] via the recursions below. 

The q-SF Algorithm 

Fix M, L, q and (3. 
Set Z«(0) = 0,i = 1,...,N. 

Fix parameter vector 0(0) = (6> (1) (0), . . . , 6>W(0)) T . 
for n = to M - 1 do 

Generate i.i.d. standard q-Gaussian distributed random 
variables 7/W(n), . . . ,rj^ N \n) and set 
?y (n) : = (r,^\n),...,^ N ){n)) T . 
for m = to L — 1 do 

Generate the simulation Y n L^-m governed with pa- 
rameter (9[n) + j3rj(n)). 
for i = 1 to N do 

zW(nL + m + l) = (1 - b(n))Z^(nL + m) 



+b(n) 



end for 
end for 

for z = 1 to N do 

0«(ti + 1) = T (0«(n) - a{n)Z^{nL)). 
end for 

Set 9{n + 1) := (0«(n + 1), . . . , flW(n + 1)) T 
end for 

Output 9{M) := (^^^(M), . . . , 0^ (M)) T as the final 
parameter vector. 



V. Numerical Experiment 
A. Numerical Setting 

We consider a two-node network of M/G/l queues with 
feedback. This setting here is somewhat similar to that consid- 
ered in 0. Nodes 1 and 2 are fed with independent Poisson 
external arrival processes with rates Ai = 0.2 and A2 = 0.1, 
respectively. After departing from Node-1, customers enter 
Node-2. Once the service at Node-2 is completed, a cus- 
tomer either leaves the system with probability p = 0.4 or 
joins Node 1. The service time processes of the two nodes, 



{Sl l {9i)} n ^,i and {S 2 (9 2 )} n ^i, respectively, are defined as 

S n {9i) = U t {n) 



R; 



£=1,2, u>l, 

(13) 

where Ri = 10 and R 2 = 20 are constants. Here, U\{n) and 
U2 (n) are independent samples drawn from the uniform distri- 
bution on (0,1). The service time of each node depends on the 
Ni -dimensional tunable parameter vector 0$, whose individual 
components lie in a certain interval [(0$ )min, (^1 )max], 
j = 1,2, ...,Ni, i = 1,2. 9i{n) represents the nth update 
of the parameter vector at Node-z, and 9i represents the target 
parameter vector. 

The cost function is chosen to be the sum of the two queue 
lengths at any instant. For the cost to be minimum, S l n (9i) 
should be minimum, and hence, we should have 9i(n) = 9j, 
i = 1,2. We denote 6 = 



and 9 = 



3(1) n(Ni) n d N ^)\ c lip AT 

?f 3 , .., e[ Nl \e 2 , .., 9 { 2 N2) ) e R N , where N=N 1 +N 2 . 
In order to compare the performance of various algorithms, 
we consider the Euclidian distance between 9(n) and 9. as 
our performance measure. 

For the simulations, we use the following values of parameters: 

1) Ni = N 2 = 2, 

2) (el j) ) min = 0, {9\ 3) ) max = 5 for j = 1, ..,N U 1 = 1,2, 
which implies C = [0, 5] . 

3) 9W> (0) = 5, 9^ = 1 for j = 1, 2, . . . , N, 

4) M = 10000, L = 100, 

5) a(n) = 1/n, 6(n) = 1/n 2 / 3 . 

B. Simulation Results 

The simulations are performed by varying the parameters q 
and (3. For each case, the results are averaged over 20 inde- 
pendent trials. We compare the performance of our algorithm 
with the SF algorithm proposed in 0, which uses Gaussian 
smoothing. Table I presents the performance comparison for 
different values of q and j3. 





—■ SF Algorithm 
— q-SF (q = 0.8) 
—q-SF (q = 0.9) 









Number of Iterations 



Fig. 1: Convergence behavior of the algorithm for /3 = 0.25. 

It can be observed from the results that for very low 
values of /3, our algorithm performs better than Gaussian-SF 
algorithm for q > 1. However, for relatively high values of j3, 
better performance can be obtained with q < 1. It can also be 
seen that for higher values of q, the algorithm tends to become 
unstable for high /3. As per the observations, the best value of 
q is 0.9, which performs better than Gaussian in 63% cases. In 
fact, q — 0.9 also gives the best performance (least distance) 
among all values of q in most of the cases (50%). 
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Lemma 6.3. For a given q < 3, q ^ 1, as n 



oo, 



TABLE I: Performance (mean distance from optimum). 



VI. Sketch of Convergence Analysis 

Here, we give a sketch of the proof of convergence of the 
proposed algorithm. We just state the important results. The 
proofs will be given in a longer version of the paper. 

Let = o-(e^(k),rj^(k),Y k ,k > l,i = 1,...,N), 
I ^ 1 denote the er-fields generated by the above mentioned 
quantities, where flW(fc) — #W(n) and fj^(k) = rj^(n) for 
i = 1, . . . N, nL ^ k < (n + 1)L. Define {b(n)} n ^ such 
that b(n) = &([£]), where [x] is the integer part of x. Thus, 



oo 

E 

n=0 



b(n) = oo, b(n) 2 < oo and b(n) = o(b(n)). 

n=Q 



Using above notation, we can rewrite Step 9 of our algorithm, 
with p = nL + m, in the following way: 



zW(p + i) = (i_6(p))zW(p)+ 

f) {l) (p)h(Y p ) 
./3(l-7M"« 2 ) 
Define the sequences {M^> (p)} p ^i, i = 1, • • -N as 



Hp) 



(14) 



M^{ P ) = Y}{k) 



fj^(k)h{Y k ) 



k=l 



. Mi-tMII^HII 2 

V {l) (k)h(Y k ) 

Mi-g^ll^)!! 2 



J"(fc- 1) 



(15) 



Lemma 6.1. The sequences {M^ (p), T(p)} p ^i, i = 1, 
2, . . . 7Y are almost surely convergent martingale sequences. 

Consider the following ordinary differential equations: 

0(f) = 0, (16) 
Z(t)=D qj3 [J(9)}- Z(t). (17) 

Lemma 6.2. The sequence of updates {Z(p)} is uniformly 
bounded with probability 1. 



Z(nL)-D q ^[J(6(n))] 



with probability 1. 



The following corollary follows directly from Proposi- 



tion 3.3 and Lemma 6.3 by triangle inequality. 

Corollary 6.4. Given a particular q < 3, with probability 1, 

as n — > oo and j3 — > 0, 

2A n 



Z(nL) - 



-V o J(0) 







(3-9) 

Now, finally considering the ODE corresponding to the 
slowest timescale recursion: 

6(t) = f f- ? | A « T V 9 J(0(f))Y (18) 



where T(f(x)) := lira 
continuous function / : 
lie in the set S — e C 

for given 8 > 0, S s := |« 



(3-9) 

( r(x+ef(x}) 

°..v 



: r 

€ C 



for any bounded, 
N . The stable points of ( fT8| ) 

J(0(f))) = o}. 

\\d-9o\\<8,e eSy 



Let 



Theorem 6.5. Under Assumptions |77| - \IV\ given q < 3, q ^ 1 
and 8 > 0, 3/3q > such that for all f3 G (0, j3o], the sequence 
{9(n)} obtained using the q-SF algorithm converges to a point 
in S s with probability 1 as n — > oo. 

VII. Conclusion 

g-Gaussians exhibit power law behavior, which gives a 
better control over smoothing of functions as compared to 
normal distribution. We have extended the Gaussian smoothed 
functional gradient estimation approach to q-Gaussians, and 
developed an optimization algorithm based on this. We have 
also presented results illustrating that for some values of q, 
our algorithm performs better than the SF algorithm in ||3l . 
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